• 1 Introduction
  • 2 SMART Problem
    • 2.0.1 Why did we choose this topic?
    • 2.0.2 What prior research and analysis have been done on this topic?
    • 2.0.3 Your SMART questions, and how did they come up?
    • 2.0.4 After the EDA, did your questions change? If so, how?
  • 3 Cross-Sell RAW Data - Description
  • 4 Exploratory Data Analysis
    • 4.1 Response Variable Trend
    • 4.2 TREND OF ALL FEATURES IN DATASET WITH RESPONSE
      • 4.2.1 Age v/s Response
      • 4.2.2 Gender v/s Response
      • 4.2.3 Vehicle Damage v/s Response
      • 4.2.4 Region Code v/s Response
      • 4.2.5 Driving License v/s Response
      • 4.2.6 Previously Insured v/s Response
      • 4.2.7 Vehicle Age v/s Response
      • 4.2.8 Policy Sales Channel v/s Response
      • 4.2.9 Annual Premium v/s Response
      • 4.2.10 Vintage v/s Response
  • 5 INFERENTIAL STATISTICS - HYPOTHESIS TESTING
    • 5.1 Removing Outliers
    • 5.2 Converting categorical variables into factors
    • 5.3 t-Test for numerical variables
      • 5.3.1 t-Test for Age
      • 5.3.2 t-Test for Policy Sales Channel
      • 5.3.3 t-Test for Annual Premium
      • 5.3.4 t-Test for Vintage
    • 5.4 χ² test for categorical variables
      • 5.4.1 χ² test for Gender
      • 5.4.2 χ² test for Driving License
      • 5.4.3 χ² test for Region Code
      • 5.4.4 χ² test for Previously Insured
      • 5.4.5 χ² test for Vehicle Age
      • 5.4.6 χ² test for Vehicle Damage
  • 6 Conclusion
    • 6.1 Establishing correlation

1 Introduction

What is Cross-Sell?

Cross-selling in insurance is the act of promoting products that are related or complementary to the one(s) your current customers already own or use. It is one of the most effective methods of marketing.

Client Profile:

An insurance company that provides medical insurance to its customers wants to know how many of their existing policyholders (customers) from last year will also be interested in Vehicle Insurance provided by the company.

  1. What is an Insurance Policy? An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

  2. What is Vehicle Insurance? Vehicle insurance is insurance for cars, trucks, motorcycles, and other road vehicles where every year customer needs to pay a premium of certain amount to insurance provider so that they provide financial protection against physical damage or bodily injury resulting from traffic collisions and against liability that could also arise from incidents in a vehicle.

2 SMART Problem

Whether a customer would be interested in an additional insurance service like vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue. We have following information to assist our analysis: demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

2.0.1 Why did we choose this topic?

Insurance was a familiar field to our team and cross-selling is a widely used strategy in insurance market. Hence, we decided to pursue cross-sell analytics.

2.0.2 What prior research and analysis have been done on this topic?

We were trying to find some effective ways to understand the cross selling in detail and we came across following links:

https://www.yieldify.com/blog/cross-selling/ https://www.podium.com/article/cross-selling/ https://www.business.com/articles/how-to-boost-sales-with-cross-selling-and-cross-promotion/

2.0.3 Your SMART questions, and how did they come up?

After studying the data set, we realized which attributes contribute to the response of the customer. Following that we formulated our SMART question and sub-SMART questions around response.

2.0.4 After the EDA, did your questions change? If so, how?

Our SMART questions didn’t change but we modified them a little to better project the impact of independent attributes on the “Response” of customer.

This report is organized as follows:

  1. Description of the Data (explanation of the dataset and its variables)
  2. EDA - Target variable and Independent variable
  3. Hypothesis Testing: t-Test & Chi-Square
  4. Conclusion

3 Cross-Sell RAW Data - Description

As mentioned previously, our dataset houses 381109 observations across 12 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and come from the following link; astericks next to variable name indicates usage in our analysis

dataset

For our exploratory data analysis, we ignored “id” because this is a independent variable with no relation to customers “Response”.

4 Exploratory Data Analysis

We will be able to get a idea on the outlier here by the percentiles ( In the Annual_Premium the 3rd quartile is 39400 and the max is 540165 this represents the outlier in this column.

4.1 Response Variable Trend

From the plot we can say that there’s imbalance in response. The individuals interested in purchasing a vehicle insurance are only 12.6%.

4.2 TREND OF ALL FEATURES IN DATASET WITH RESPONSE

4.2.1 Age v/s Response

Variable Age looks like right skewed and the count is maximum for age 25. The Age is important because there is a difference in the medians between accepting and rejecting, as it is possible to observe in the Box-Plot. Older people are who acquire insurance in comparison with those who do not.

4.2.2 Gender v/s Response

Male category is slightly greater than that of female and chances of buying the insurance is also little high

4.2.3 Vehicle Damage v/s Response

The distribution of customers with or without vehicle damage is almost same. The ones with vehicle damage are more interested in vehicle insurance.

4.2.4 Region Code v/s Response

Region Code 28 seems to have highest customers and also the highest customers interested in vehicle insurance.

4.2.5 Driving License v/s Response

99% of customers have driving license and customers interested in Vehicle Insurance have driving license

4.2.6 Previously Insured v/s Response

Customer who don’t have an insurance are higher in number than those who have insurance. Also they are more likely to buy the insurance.

4.2.7 Vehicle Age v/s Response

Customer who own a vehicle for more than 2 years are not many but some of them are interested in getting vehicle insurance. Mostly customers with vehicle for 1-2 years are interested in vehicle insurance.

4.2.8 Policy Sales Channel v/s Response

4.2.9 Annual Premium v/s Response

We can see from graph comparison above that there are a lot of outliers in for Annual Premium. We remove them and see what trend emerges. Also this is not a normally distributed variable.

As observed in the boxplot, the medians are slightly different, so this variable will be necessary for the model.

4.2.10 Vintage v/s Response

Looking at the box-plot, we can see that the medians are almost at the same level; that is why the variable is nos helpful because it does not discriminate between accepting and reject

5 INFERENTIAL STATISTICS - HYPOTHESIS TESTING

5.1 Removing Outliers

With the Log we are making it smoother, so it is better for visualization. The BoxPlot is useful because it can show that that the mean is not at the same level, so there is discrimination between profiles. It is possible to infer that people with a greater Annual_Premium take the insurance.

5.2 Converting categorical variables into factors

5.3 t-Test for numerical variables

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features.The t-test is one of many tests used for the purpose of hypothesis testing in statistics.Calculating a t-test requires three key data values. They include the difference between the mean values from each data set (called the mean difference), the standard deviation of each group, and the number of data values of each group.For t-Test, we split the original dataset “vehicle” into subsets of customers who “accepted” or “rejected” the insurance.

5.3.1 t-Test for Age

## 
##  One Sample t-test
## 
## data:  accepted$Age
## t = 759, df = 45154, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  43.2 43.4
## sample estimates:
## mean of x 
##      43.3
## 
##  One Sample t-test
## 
## data:  rejected$Age
## t = 1379, df = 3e+05, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.0 38.1
## sample estimates:
## mean of x 
##        38

5.3.2 t-Test for Policy Sales Channel

## 
##  One Sample t-test
## 
## data:  accepted$Policy_Sales_Channel
## t = 352, df = 45154, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  92.2 93.2
## sample estimates:
## mean of x 
##      92.7
## 
##  One Sample t-test
## 
## data:  rejected$Policy_Sales_Channel
## t = 1237, df = 3e+05, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  115 115
## sample estimates:
## mean of x 
##       115

5.3.3 t-Test for Annual Premium

## 
##  One Sample t-test
## 
## data:  accepted$Annual_Premium
## t = 410, df = 45154, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  29856 30143
## sample estimates:
## mean of x 
##     30000
## 
##  One Sample t-test
## 
## data:  rejected$Annual_Premium
## t = 1138, df = 3e+05, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  29112 29213
## sample estimates:
## mean of x 
##     29163

5.3.4 t-Test for Vintage

## 
##  One Sample t-test
## 
## data:  accepted$Vintage
## t = 391, df = 45154, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  153 155
## sample estimates:
## mean of x 
##       154
## 
##  One Sample t-test
## 
## data:  rejected$Vintage
## t = 1053, df = 3e+05, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  154 155
## sample estimates:
## mean of x 
##       154

From t-test we can conclude that p-value of all numerical variables for accepted and rejected sub-groups are less than alpha (0.05). Hence, the NULL Hypothesis can be rejected, i.e, the mean of accepted and rejected is not same as the mean of the population dataset

5.4 χ² test for categorical variables

We use Chi-square (χ²) test for the categorical variables - Gender, driving License, Region Code, Previously Insured, Vehicle Age, Vehicle damage and Response to establish dependency. We have used “Test of Independence”. If the p-value is less than 0.05, which is our alpha, we can conclude that our variables are not independent, we fail to reject the null hypothesis and it is statistically significant for our model.

5.4.1 χ² test for Gender

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 1014, df = 1, p-value <2e-16
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 1.54578483273509e-222"
## [1] "Gender is not independent of response"

5.4.2 χ² test for Driving License

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 34, df = 1, p-value = 6e-09
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 6.29491414277065e-09"
## [1] "Driving License is not independent of response"

5.4.3 χ² test for Region Code

## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 2958, df = 4, p-value <2e-16
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 0"
## [1] "Region Code is not independent of response"

5.4.4 χ² test for Previously Insured

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 43092, df = 1, p-value <2e-16
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 0"
## [1] "Previously Insured is not independent of response"

5.4.5 χ² test for Vehicle Age

## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 18063, df = 2, p-value <2e-16
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 0"
## [1] "Vehicle Age is not independent of response"

5.4.6 χ² test for Vehicle Damage

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 46489, df = 1, p-value <2e-16
## [1] "Alpha value is set as 0.05 and p -value from Pearson's test is: 0"
## [1] "Vehicle Damage is not independent of response"

6 Conclusion

6.1 Establishing correlation

After looking at our hypothesis tests, we can conclude that - “NULL Hypothesis can be rejected”. This means that numerical attributes have a statistically significant w.r.t our dependent variable - Response - and needs to be analysed further. After the tests, we did correlation, to understand which variables are “more” significant in impacting “Response” and we can conclude that vehicle_damage, previously_insured and vehicle_age have high correlation.